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ABSTRACT 


Massive Open Online Courses (MOOCs) have become more 
and more popular recently. These courses have attracted a 
large number of students world-wide. In a popular course, 
there may be thousands of students. Such a large number 
of students in one course makes it infeasible for the instruc- 
tors to grade all the submissions. Peer assessment is thus an 
effective paradigm that can help grade the submissions at a 
large scale. However, due to the variance in the ability and 
standard of the student graders, peer grades may be noisy 
and biased. Aggregating peer grades to have an accurate 
and fair final grade for a submission is a challenging prob- 
lem because the reliability and bias degrees of graders are 
usually unknown in practice. To address this issue, some 
probabilistic models considering the graders’ reliability and 
bias are proposed. However, due to the sparsity of peer grade 
observations, it is difficult for these models to estimate the 
accurate reliability and bias of the graders as well as the true 
grades of the submissions. Compared with absolute peer 
grades, the relative peer grades, derived from the difference 
between the peer grades of two submissions graded by the 
same grader, are less sparse and more robust to the grader’s 
bias. Thus relative peer grades are informative and helpful 
in cardinal peer grading estimation whose goal is to estimate 
the absolute numeric grades of submissions. In this paper, 
we propose two new probabilistic models to help improve the 
accuracy of cardinal peer grading estimation using the ob- 
served relative grades among submissions. In this way, the 
relation between the true grades among submissions is taken 
into consideration when deriving the final grades. Experi- 
mental results on real MOOC peer grading datasets show 
that the proposed models outperform baselines and the re- 
lation of true grades among submissions indeed contributes 
to the improvement in the grade estimation. 


Keywords 
Peer grading, relative peer grades, MOOCs 


Tianqi Wang, Qi Li, Jing Gao, Xia Jing and Jie Tang "Improving 
Peer Assessment Accuracy by Incorporating Relative Peer 
Grades" In: Proceedings of The 12th International Conference on 
Educational Data Mining (EDM 2019), Collin F. Lynch, Agathe 
Merceron, Michel Desmarais, & Roger Nkambou (eds.) 2019, pp. 
450 - 455 


Proceedings of The 12th International Conference on Educational Data Mining (EDM 2019) 


Qi Li 


University of Illinois 
qliz2@buffalo.edu 


Jing Gao 
University at Buffalo 


jing@buffalo.edu 


Jie Tang 
Tsinghua University 


jery.tang@gmail.com 


1. INTRODUCTION 


Massive Open Online Courses (MOOCs) have provided mil- 
lions of learners with open access to high quality courses 
via web. For a popular course, there may be thousands of 
students. Recently, several MOOC platforms offer verified 
certificates or even degree programs, and peer grading plays 
an important role in the student performance evaluation. 
The benefit of peer grading is two-folded. On one hand, 
it is helpful for the instructors to evaluate students perfor- 
mance, which is otherwise infeasible due to the large number 
of enrollment. On the other hand, it is also beneficial to the 
students: they can see peers’ work from different aspects 
and increase their involvement in the course [5]. Especially, 
peer grading can be used when automatic grading cannot 
be applied, for example, on essays and projects. A typical 
process of peer assessment includes two steps: first, students 
are assigned to grade a subset of submissions and then the 
platform aggregates these peer grades to compute the final 
grades of these submissions. 


Although peer grading is helpful, it is a challenging problem 
to aggregate these peer grades and determine the final grade 
of a submission. In this paper, we consider the case of cardi- 
nal peer grading (i.e., each submission receives a numerical 
grade as the final grade). Most platforms use the median 
of received peer grades as the final grade of a submission. 
However, the median grade may be inaccurate due to the 
different reliability and bias degrees of graders. Usually, the 
difference between the grade given by a grader and the true 
grade of the submission can be decomposed into bias and 
reliability degree. Suppose a grader grades multiple submis- 
sions, and then the bias represents the difference between 
the mean grades of this grader and the true grades on these 
submissions. The reliability degree of the grader is measured 
by the variance of the difference between the the grades that 
the grader gives and the true grades of these submissions. If 
a grader randomly assigns grades to the submissions, he/she 
is not a reliable grader. If the variance is small, then a grader 
grades the submission in a consistent way and is thus a re- 
liable grader. It is important to consider the modeling of 
grader bias and reliability to derive more accurate estimates 
of the final grades. Therefore, there are some existing efforts 
towards this direction [7]. 


However, the mechanism of peer grading that each student 
only grades a small subset of submissions leads to a data 
sparsity issue. The sparsity of the observed grades makes 
these models difficult to correctly estimate reliability, bias 
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of the grader and the true grades of the submissions. In ad- 
dition the observed grades are sensitive to the grader’s bias. 
Compared with absolute observed grades, the relative peer 
grades between two submissions are less sparse and more 
robust to the grader’s bias, since the relative peer grades 
are derived from the difference of the grades assigned by the 
same grader to two different submissions. Thus the relative 
peer grades are informative in estimating the true grades 
of submissions. However, all existing cardinal peer grading 
estimation models [7, 6, 2] only consider the absolute peer 
grades of each submission. None of these models considers 
the relative grades between two submissions. 


Recognizing the importance of relative peer grades, we de- 
velop new probabilistic graphical models by leveraging rel- 
ative peer grades between submissions to model the depen- 
dency between the true grades. The proposed probabilistic 
models estimate the true grades of submissions from the 
peer grades as well as relative peer grades by modeling the 
bias and reliability of graders. Gaussian distributions are 
applied to model the true grades, the bias of grader, the 
absolute peer grades, and relative peer grades in the pro- 
posed models. Two different distributions are proposed to 
estimate the reliability of the graders. In the first model, 
the reliability of the grader follows a Gamma distribution 
with the shape parameter determined by the grader’s own 
true grade, while in the second model, it follows a Gaus- 
sian distribution with the mean equal to the grader’s true 
grade. To evaluate the proposed models, experiments are 
conducted on peer grading datasets collected from a pop- 
ular MOOC platform in China. Experimental results show 
that the proposed models improve the accuracy of the cardi- 
nal peer grading estimation by considering the dependency 
of true scores between two submissions. The main contribu- 
tions of this paper are summarized as follows: 


e We find that relative peer grades among submissions 
can help improve cardinal peer grading estimation ac- 
curacy. 


e We propose new probabilistic graphical models by in- 
corporating observed relative grades to model the de- 
pendency between the true grades of these two sub- 
missions. 


e We evaluate the proposed models on real peer grad- 
ing datasets and experimental results show that the 
proposed models can improve the accuracy of cardinal 
peer grading estimation. 


2. RELATED WORK 


Existing work on peer assessment aggregation can be di- 
vided into two categories based on the data types: the car- 
dinal and ordinal peer grade estimation. The goal of ordinal 
peer grade estimation is to rank the students according to 
their submissions. Models based on pair comparison [10, 8], 
Bayesian generative approach [12] and matrix factorization 
are developed for the ordinal peer grades estimation [1]. 


For cardinal peer grading estimation, students are asked to 
grade their peers’ submissions by assigning a specific nu- 
merical grade and the aim of cardinal grades estimation is 
to find the absolute true scores of the submissions. Below we 
summarize the existing work related to cardinal peer grad- 
ing estimation respectively. 


One major approach of cardinal peer grading estimation is to 


update grades and grader weights iteratively [4, 12, 3]. An- 
other major category of methods are based on probabilistic 
graphical models [7, 6, 2]. The proposed models in this pa- 
per fall into this category. The main idea is to model the true 
grade of a submission, the reliability and bias of each grader 
as hidden random variables following certain distributions, 
and infer the model parameters by fitting the models on ob- 
served peer grades. In particular, the following methods [7, 
6] (referred to as PG, to PGs) are the most relevant to our 
proposed model. In [7], three probabilistic graphical mod- 
els named PGi, PG2 and PG3 are proposed. PG is the 
basic model, which assumes that true grades, observed peer 
grades, and biases follow Gaussian distributions and the re- 
liability of the grader follows a Gamma distribution. Upon 
PG, PG2 links the bias of a grader among assignments, 
and PG'3 couples the grader’s grade of his/her submission 
and the grader’s reliability. In PG3, the grader’s reliability 
is modeled as a linear function of the grader’s grade. To 
relax this assumption of linear relationship, two extensions 
of PG3 referred as PG'4 and PGs are later proposed in [6]. 
Both PG4 and PGs assume the reliability of a grader is 
related to the grader’s own grade, and use either Gamma 
distribution or Gaussian distribution to model this reliabil- 
ity. Recently, social connections are also considered in the 
modeling of the dependencies of bias among students [2]. 


However, all existing cardinal peer grading estimation meth- 
ods only consider absolute grades. In these methods, the 
true grades of different submissions are treated indepen- 
dently. None of these models takes the relative grades into 
consideration. In fact, leveraging the relative grades be- 
tween submissions to model the dependency between true 
grades of these two submissions can help reduce the noise 
introduced by the bias of graders and alleviate the data spar- 
sity issue, and thus can help to improve the accuracy of 
cardinal peer grading estimation. To the best of our knowl- 
edge, this is the first work that integrates relative grades 
into cardinal peer grading aggregation to achieve improved 
estimation. 


3. PROBLEM DEFINITION 


In this section, we first introduce some concepts and nota- 
tions used in the rest of this paper. Then we formally define 
the problem. 


The set of all the students is denoted as S and the set of 
all the graders is denoted as G. Under the peer grading set- 
ting, G C S, since the graders are students as well. The 
observed absolute grade (peer grade) of a submission sub- 
mitted by student 7 graded by grader g is denoted as 27, and 
the observed relative grades (relative peer grades) between 
submissions submitted by students 7 and 7 graded by grader 
g is denoted as dj,. The relative peer grades are derived 
using absolute peer grades, which are the difference of the 
absolute peer grades. For example, if a grader g assigned a 
score of 4 to the submission submitted by student 7 and a 
score of 6 to the submission submitted by student j, then 


z? is 4 and z? is 6. We can derive that the relative grade 
dj, = 27 af = 6-—4= 2. The subset of students whose 


submissions are graded by an arbitrary grader g € G is de- 
scribed as S, and the set of graders who assign grades to 
the submission submitted by student i is defined as G;. 


With these definitions introduced, we define the cardinal 
peer grading estimation problem as follows: Given a set of 


451 Proceedings of The 12th International Conference on Educational Data Mining (EDM 2019) 


students S, a set of graders G, a set of peer grades fee 


ies 
and relative peer grades td eee a pays we want to estimate 
the true absolute grade for submission submitted by student 
i, Vi € S, and to learn the reliability and bias for each grader 


g, VG EG. 


4. METHODOLOGY 


In this section, we describe our probabilistic graphical mod- 
els named PG and PG; for cardinal peer grading estima- 
tion. Both models specify a two-stage generation for the 
peer grades and relative peer grades. The first stage specifies 
the generation of graders’ bias, reliability and true scores of 
submissions and the second stage generates the peer grades 
and relative peer grades given the grader’s bias, reliability 
as well as the true scores of submissions. 


True score generation: In the proposed models, the true 
score of the submission submitted by student i is modeled 
as a random variable following a Gaussian distribution. 


Grader bias generation: The bias of grader g is denoted 
as bg, which measures the constant grade inflation or defla- 
tion of a grader. We model the grader’s bias as a random 
variable following a Gaussian distribution. Though different 
graders may have different bias, we can assume the average 
of all graders’ bias is 0. 


Grader reliability generation: The reliability of a grader 
reflects how consistent a grader assigns grades. A reliable 
grader keeps a stable bias when assigning grades to differ- 
ent submissions. Following the assumptions in [11], we as- 
sume that the reliability of a grader is related to his/her 
own grade, which reflects the grader’s knowledge about the 
assignment. We assume that the grader with a higher grade 
of the assignment may be a more reliable grader for submis- 
sions of the same assignment. The reliability of a grader g is 
denoted as 7, and modeled as a random variable following a 
Gamma distribution in the PGg model and a Gaussian dis- 
tribution in the PG model, respectively. In the PG¢ model 
the grader’s true grade is used as the shape parameter of 
the Gamma distribution, while in the PG'7 model it is used 
as the mean value of the Gaussian distribution. 


Peer grade generation: After generating the bias and 
reliability of graders as well as the true scores, the peer 
grades can be generated with these variables. The peer grade 
is modeled as a variable following a Gaussian distribution 
whose mean is the sum of the true grade of the submission 
and the bias of the grader, and its variance is inversely pro- 
portional to the reliability of the grader. In the PG7 model, 
we introduce a hyper-parameter » to tune the scale of the 
variance. 


Relative peer grade generation: To incorporate more 
observations to estimate the reliability and bias of the grader 
and the true grade of the submission, the relative peer grade 
is generated. The generation of relative peer grade provides 
us with another view of true score of a submission in addi- 
tion to the traditional way that models the true grade as the 
sum of observed peer grade and the bias of the grader. With 
the relative peer grade, the true grade s; of submission 7 can 
be estimated by the sum of the true grade s,; of submission j 
and the relative peer grade between these two submissions. 
In such a way, the influence of grader bias is excluded. 


Similarly to the generation process of peer grade, the rel- 
ative peer grade is generated with the given true grades of 
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Figure 1: The plate notation of the PGs and PG, 
model. 


Table 1: Notations 


Notation Description 
5S set of all students 
G set of all graders 
Tg reliability of grader g 
bg bias of grader g 
Si true grade of submission from student 2 
9 observed grade of submission 
$ from student i by grader g 
ds observed grade differences between 
tj 


submissions from student 7 and j by grader g 


two submissions and the reliability of the grader. We assume 
the relative peer grade follows a Gaussian distribution with 
mean value equal to the difference of the true grades be- 
tween two submissions and variance inversely proportional 
to the grader’s reliability. Also, in the PG7 model, A is used 
to specify the scale of the variance. 


Figure 1 shows the graphical structure of the PGg and PG7 
models. The box in the middle indicate a grader g and the 
first and last box indicate student i and 7 whose submissions 
are graded by grader g. Table 1 summarizes the notations 
of variables. 


In the PG¢g model and PG7 model, the grader’s reliability 
Tg and bias b, and the submission’s true grade s; are the 
latent variables that need to be estimated. However, these 
latent variables are related to each. To estimate the values 
of these latent variables, Gibbs sampling is applied in this 
work to draw samples of a latent variable from an approx- 
imated posterior distribution. After enough iterations, we 
discard the first few burn-in iterations and we use the mean 
value of sampled s; as the final estimate of the true score 
of submission 7. For s; in PG6 and tg in PG7, we cannot 
find a closed form of the posterior distribution, so we use 
a discrete approximation to get the approximate posterior 
distribution of these two variables. Next we will describe the 
details of generation process and the inference of the PG¢ 
and PG model separately. 


4.1 The PG, Model 


The generative process of the PGg model is as follows: 


e For each submission submitted by student 7 
- Draw true grade s; ~ N(, =) 

e For each grader g 
- Draw bias bg ~ N(0, =) 


- Draw reliability tz, ~ I'(s,, 8) 


g 


t 


e For each peer grade z 
by grader g 


submitted by grader 7 graded 
- Draw peer grade 2% ~ N(s; + bg, =) 
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e For each relative peer grade df, between submissions 
submitted by student 2 and 7 graded by grader g 
- Draw relative peer grade d?, ~ N(si — 83, 2) 
g 


In the PGg model, the posterior distribution of the true 
score of submission s; does not have a closed form. To have 
an approximate distribution of this latent variable, in this 
paper, we discretized the true score of submission s; from 0 
to 15 (the full mark of the assignment) with an interval of 
0.1. The variables are updated according to Eq. 1. 


Dies, Tg(z? — si) 1 


b~aN a = 
( 7 + |SglTq "n+ |SqlTo 


Wiese Matas. 


; ) 


where R = aes (F Tg(|Sg| + 1)), and 
Y = by + Te(Yoce, (Z (27 bg) + Ugea; dies, 


4.2 The Pc, Model 

The difference between PG7 model and PGs model lies in 
the grader reliability generation: PG7 adopts Gamma dis- 
tribution while PG‘. adopts Gaussian distribution. The gen- 
erative process of the PG7 model is as follows: 


(d? ra 


5). 


e For each submission submitted by student 7 
- Draw true grade s; ~ N (1, =) 


e For each grader g 
- Draw bias b, ~ N(0, = 
- Draw reliability tT, ~ N(sq, 8) 


e For each peer grade z? submitted by grader i graded 
by grader g 
- Draw peer grade 29 ~ N(si + bg, >) 
g 


e For each relative peer grade df, between submissions 


submitted by student i and j graded by grader g 


- Draw relative peer grade d?, ~ N(si — sj, 2) 


In this model, the posterior distribution of the reliability of 
a grader Ty does not have a closed form neither and we apply 
discrete approximation to approximate the posterior distri- 
bution of grader’s reliability from 0 to 15 with an interval of 
0.1. The variables are updated according to Eq. 2. 


dies, (4 — $i) 1 


b~ N( T, 9:5 a ye alge: 
n+ [Sql n+ [Sal 


) 


ISgP? —B 
TXT 2 X exp(— 5 [tT — (8g— 


dies, (2f — 8: — bg)? oe JESg (dj; - 
2B AB 
Y ban 
RR 
Tg*(|Sg|— 
where R= 7+ 64 eed, 3 So eee ure » and 


Dgec, Vjes, (4%; +53) 
¥ = yt Brit 2 (Xyeq, (2f — 69) + Se), 


(2) 


LS 


sv N(— 


_ ‘Table 2: Dataset Statistics 


Question! Question2 Question3 

# of graders 100 237 105 

## of submissions 126 288 141 
# of peer grades 493 1121 516 

# of instructor grades 114 257 123 
full grades 15 15 15 
observed mean 6.8 6.7 6.2 
observed variance 0.11 0.12 0.14 


5. EXPERIMENTAL RESULTS 


We perform experiments on a real-world dataset with three 
questions to evaluate the performance of the proposed mod- 
els, and we show the results in this section. 


5.1 Dataset 


The real dataset including peer grades for three questions 
was collected from a course named “Immortal Arts: Ap- 
proaching the masters and classics” on the XuetangX plat- 
form '. For each question, students are asked to write an 
essay between 100 and 250 words. The peer graders for each 
submission are automatically assigned by the platform and 
the grading process is double-blind. After receiving the peer 
grades, the platform uses the median of peer grades as the 
final grades for submissions. The grades assigned by TAs 
are also available in this dataset, which we use as ground 
truth (true score) in evaluation. The overall statistics of 
this dataset is shown in Table 2. 


5.2 Baselines 

In order to evaluate the effectiveness of the proposed models, 
we compare them with 6 baselines, which are discussed as 
follows. including the median of peer grades, the mean of 
peer grades, the PG; model and the PG3 mode in [7] and 
the PG4 model and PGs model in [6]. 


e Median: This approach takes the media of peer grades 
as the final grade. This is the most frequently used 
method to aggregate peer grades in MOOC platforms 
such as Coursera” and XuetangX platform. 


e Mean: This approach simply assigns the mean value of 
peer grades as the final grade to a submission. In some 
cases, using the mean value of peer grades as the final 
peer grades may achieve good performance according 
to [9]. 


e PG: This is the first probabilistic model for cardinal 
peer grading estimation that considers the reliability 
and bias of graders [7]. 


e PGs: This is a probabilistic model that links the grader’s 
reliability with the grader’s own grade. This model 
assumes that the variance of distribution for the peer 
grades is inversely proportional to a linear function of 
the grader’s grade [7]. 


e PG.: This is a probabilistic model assuming that a 
grader’s reliability follows a Gamma distribution with 
the shape parameter equal to the grader’s own grade. 
The PGg model is an extension of this model [6]. 


www. xuetangx.com 
www.coursera.org 
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Table 3: Experimental Results 


Question 1 Question 2 Question 3 

Mean Std Mean Std Mean _ Std 
Mean 1.80 2.29 2.06 
Median 2.19 2.57 2.29 


PGi 1.97 002 2.34 0.01 2.21 0.02 
PG3 1.69 0.07 2.85 0.01 1.92 0.01 
PG 2.54 0.02 2.94 0.02 3.07 0.02 
PG 1.31 001 1.44 0.01 1.38 0.02 
PGs 1.52 0.04 1.80 0.01 1.74 0.02 
PG, 1.24 0.002 145 O01 1.31 0.01 


e PGs: This is a probabilistic model assuming that a 
grader’s reliability follows a Gaussian distribution with 
the mean equal to the grader’s own grade. The PG7 
model is an extension of this model. 


5.3. Experimental Settings 

As described before, many hyper-parameters are used in 
the proposed models and baselines, and it is important to 
set reasonable values for these hyper-parameters. In this 
section, we describe how to set the values of these hyper- 
parameters in our experiment. 


Since the proposed models are the extensions of the PG, 
and PGs model in [6], to evaluate the effect of leveraging 
relative grades, we set the same values for the shared hyper- 
parameters in the proposed models and the PG4 and PGs 
models. We use the mean and variance of the peer grades 
as the mean (yw) and variance (+) of the prior distribution 
of the true grade (s;). As claimed in [6], the 6 in the PG4 
model which decides the rate of the Gamma distribution for 
the grader’s reliability and the A in the PGs model which 
determines the variance of the Gaussian distribution for peer 
grades are the most important hyper-parameters. These pa- 
rameters have a significant influence on the performance of 
these two models while other hyper-parameters influence the 
performance slightly if set in a reasonable range. Thus we 
mainly tune 6 in the PG4 and PG model and 4 in the PGs 
and PG7 model. We search for these two hyper-parameters 
in the range of [50,300] with the interval of 50 to get the 
best performance. We set 7 to 0.1 in our experiment, and 
in the PGs and PG7 model, £ is set to 0.1. For each la- 
tent variable, we sample it for 300 iterations and the first 60 
iterations are the burn-in iterations that will be discarded. 
The average results over 10 runs with the hyper-parameter 
settings described above are reported. 


5.4 Real Dataset Performance 

We use Root-Mean-Square-Error (RMSE) to evaluate the 
performance of the proposed models and baselines on the 
datasets. The experimental results are shown in Table 3. 
From Table 3, we can find that on all these three ques- 
tions, the PG¢g and PG7 models outperform other baselines. 
The RMSE of the PG and the PG7 models which incorpo- 
rate the relative observed grades to capture the dependency 
between true grades of submissions has dropped compared 
with that of the PG4 and PG5 models. The results demon- 
strate the effectiveness of incorporating relative peer grades 
in cardinal peer grade estimation. 


To better illustrate the performance of the PGg and PG7 
models, we further compare the estimated grades with the 


ground truth on individual submissions in Figure 2. The 
submissions are sorted with an increasing order of the ground 
truth. Then we plot the estimated grades from Mean (the 
best naive method), PGs (the best baseline), and the pro- 
posed PG7 model which has the best performance. We can 
find that the estimated grades by all three models show an 
increasing trend, but Mean shows a strong negative bias in 
the peer grades: the peer grades are consistently lower than 
the ground truth grade. Therefore, it is important to model 
the bias in graders to improve the aggregation results. PGs 
and PG7 both show positive bias compared with the ground 
truth, but PG5’s bias is a bit higher. The comparison be- 
tween PGs and PG7 illustrates that the relative grades can 
also help estimate the bias more accurately. It may imply 
that although graders cannot give accurate absolute grades, 
they can assign accurate relative grades. 


We further compare the experimental bias estimated by the 
proposed models with the real bias. The experimental bias 
is defined as the average difference between the peer grades 
assigned by a grader and the estimated true grades. The 
real bias is defined as the average difference between the 
peer grades assigned by a grader and the ground truth. For 


example, a grader g grades two submissions from student 7 


, __ (29 —si)+(24 85) 
and 7, the experimental bias of this grader is ————_,—+—— 


: (29 —s*)+(29-s*) 
and the real bias is —_—~—+—.,, where s; and sj are 


the estimated grades, s? and s* are the groundtruth grades 
for submission 7 and j. The results are illustrated in Fig- 
ure 3, where x-axis denotes the real bias and y-axis denotes 
the experimental bias. We can see that most graders are 
harsh graders whose real biases are less than 0. The di- 
agonal means that the estimated bias is the same as the 
real bias. The closer to the diagonal, the more accurate the 
bias estimation is. We can observe that our estimated bias 
is close to the real bias. With better bias estimation, the 
proposed models achieve more accurate cardinal estimation. 
This result again indicates the informativeness of relative 
grades in estimating final grades. 


5.5 Sensitivity of Hyper-parameters 

To show how the value of hyper-parameter 8 in the PG6 
model and the hyper-parameter 4 in the PG7 model will in- 
fluence the performance, we conduct experiments using dif- 
ferent values of these two hyper-parameters with all other 
hyper-parameter fixed. In the experiment to test the sen- 
sitivity of the models, the settings for other fixed hyper- 
parameters are the same as described above and the (@ in 
the PGs model and the » in the PG7 model are set from 
50 to 300 with an interval of 50. The results in Figure 4 
show that in a reasonable range these two models are ro- 
bust to the value of the parameter and achieve acceptable 
performance. 


6. CONCLUSIONS AND FUTURE WORK 


With the popularity of the MOOCs, peer assessment has 
become an effective paradigm for large-scale grading. The 
aggregation of peer grades is a challenging problem due to 
the various levels of bias and reliability among graders that 
are unknown. Existing work contributes to the development 
of effective peer grading aggregation methods by modeling 
grader bias and reliability, but they ignore an important as- 
pect in peer grading aggregation, which is the dependency 
relation among grades. In these models, the relative grades 
are not considered and the true grades of submission are 
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Figure 2: The estimated grades of three questions using mean, the PGg and PG7 model and ground truth. 


The submissions are sorted by their ground truth. 
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Figure 3: The comparison of experimental bias 


Hyper-parameter sensitivity of the PG6 Hyper-parameter sensitivity of the PG7 
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Figure 4: Hyper-parameter sensitivity of the PG. 
and PG7 model 


modeled independently. Modeling the dependencies among 
the true grades of different submissions can help improve 
the robustness of the aggregated grade estimation. In this 
paper, we propose two novel models that leverage relative 
grades to achieve improved estimation of final grades. In 
the proposed probabilistic models, we capture the distribu- 
tions of true scores based on graders’ bias and reliability 
degrees as well as their own submission scores which repre- 
sents their knowledge about the question. In addition, the 
proposed models couple the true scores of different submis- 
sions via their differences. Effective inference algorithms are 
proposed to infer both model parameters and final scores. 
Experimental results demonstrate that the proposed models 
improve the accuracy of cardinal peer grading estimation. 
It can also be observed that the relative peer grades among 
submissions indeed contribute to the improvement in the ac- 
curacy of cardinal peer grading estimation. 


In the future, we will investigate how to better model the 
ability of graders reflecting both reliability and bias of graders 
and how to cluster the graders and submissions into different 
groups to improve the peer assessment. 
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